Skip to content

Fix high response latency with multiple API-requests targeting non-existent hosts or services#10883

Merged
jschmidt-icinga merged 4 commits into
masterfrom
better-json-error-diagnostic-information
Jul 2, 2026
Merged

Fix high response latency with multiple API-requests targeting non-existent hosts or services#10883
jschmidt-icinga merged 4 commits into
masterfrom
better-json-error-diagnostic-information

Conversation

@jschmidt-icinga

@jschmidt-icinga jschmidt-icinga commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

This fixes the high response latency observed in #10875 when multiple requests target non-existent hosts in parallel. The reason is that for almost all API endpoints that target config objects diagnostic information including a stack trace is generated, regardless if it is actually returned (which only happens when the "verbose" parameter is given). The solution is to pass the exception_ptr into SendJsonError() and only generate the DiagnosticInformation from it when needed.

Solution

SendJsonError() now takes a std::exception_ptr argument instead of a dagnosticInformation string that is generated at the call site. This is cheap and doesn't cause any copies. Then, where the actual decision is made to include the diagnostic_information (i.e. dependent on the "verbose" parameter), DiagnosticInformation(<eptr>) is run. The overload already existed, so there wasn't even a need to rethrow the exception and process the string from the result.

This is a clean solution, and the change makes sense even without the latency amplification in mind, since pointless work is avoided in any case.

However, some yak-shaving was required to feed SendJsonError() with an exception pointer gathered with std::current_exception() at the call-site. I had to make the entire code-base use std::exception_ptr instead of boost::exception_ptr. On the newest boost version, this is just a typedef, on previous versions the boost one could be constructed from the std one, but there's also old versions we still support which didn't have that conversion. And that refactor in turn uncovered a small bug in the DiagnosticInformation(<eptr>) which had a dead-code fallback that then failed to compile and is now fixed.

Testing

Using the test-script that was provided in #10875, the improvement is clearly observable:

Before

With existent host/service:

$ ulimit -n 100000 && python3 wp.py --host https://localhost:5665 --user root --password icinga --total 5000 --parallel 200 --target-host master-1 --target-service procs --no-verify-ssl
Sending 5000 requests to https://localhost:5665
Max in-flight: 200
Press Ctrl+C to stop.

Progress: 3097/5000 sent (3097 ok, 0 errors) (host: master-1) (service: procs) — min: 0.001s, avg: 0.312s, p95: 1.758s, max: 2.564s

Total test duration: 7.200s
Done: 5000 ok, 0 errors out of 5000 total requests.
Request duration — min: 0.001s, max: 2.564s, avg: 0.282s, p95: 0.904s

With non-existent host/service:

$ ulimit -n 100000 && python3 wp.py --host https://localhost:5665 --user root --password icinga --total 5000 --parallel 200 --target-host master-1 --target-service procsasdf --no-verify-ssl
Sending 5000 requests to https://localhost:5665
Max in-flight: 200
Press Ctrl+C to stop.

Progress: 566/5000 sent (566 ok, 0 errors) (host: master-1) (service: procsasdf) — min: 0.116s, avg: 0.864s, p95: 2.527s, max: 4.867s
Progress: 1496/5000 sent (1496 ok, 0 errors) (host: master-1) (service: procsasdf) — min: 0.012s, avg: 1.046s, p95: 5.539s, max: 9.837s
Progress: 2419/5000 sent (2419 ok, 0 errors) (host: master-1) (service: procsasdf) — min: 0.012s, avg: 0.998s, p95: 4.867s, max: 13.447s
Progress: 3336/5000 sent (3336 ok, 0 errors) (host: master-1) (service: procsasdf) — min: 0.012s, avg: 1.015s, p95: 4.797s, max: 18.046s
Progress: 4198/5000 sent (4198 ok, 0 errors) (host: master-1) (service: procsasdf) — min: 0.012s, avg: 1.075s, p95: 5.001s, max: 22.430s

Total test duration: 29.636s
Done: 5000 ok, 0 errors out of 5000 total requests.
Request duration — min: 0.012s, max: 22.430s, avg: 1.160s, p95: 5.629s
jschmidt@ws-jschmidt:~/Projects/dev-docker$ ulimit -n 100000 && python3 wp.py --host https://localhost:5665 --user root --password icinga --total 5000 --parallel 200 --target-host master-1 --target-service procsasdf --no-verify-ssl

After

With existent host/service:

$ ulimit -n 100000 && python3 wp.py --host https://localhost:5665 --user root --password icinga --total 5000 --parallel 200 --target-host master-1 --target-service procs --no-verify-ssl
Sending 5000 requests to https://localhost:5665
Max in-flight: 200
Press Ctrl+C to stop.

Progress: 3161/5000 sent (3161 ok, 0 errors) (host: master-1) (service: procs) — min: 0.002s, avg: 0.304s, p95: 1.796s, max: 2.792s

Total test duration: 6.844s
Done: 5000 ok, 0 errors out of 5000 total requests.
Request duration — min: 0.002s, max: 2.792s, avg: 0.269s, p95: 0.799s

With non-existent host/service:

$ ulimit -n 100000 && python3 wp.py --host https://localhost:5665 --user root --password icinga --total 5000 --parallel 200 --target-host master-1 --target-service procsasdf --no-verify-ssl
Sending 5000 requests to https://localhost:5665
Sending 5000 requests to https://localhost:5665
Max in-flight: 200
Press Ctrl+C to stop.


Total test duration: 2.545s
Done: 5000 ok, 0 errors out of 5000 total requests.
Request duration — min: 0.002s, max: 2.126s, avg: 0.096s, p95: 0.212s
Sending: command not found

You can see that the error responses for non-existant targets are now actually much faster than successful requests, as one would expect.

Fixes #10875.

@cla-bot cla-bot Bot added the cla/signed label Jun 17, 2026
@jschmidt-icinga jschmidt-icinga force-pushed the better-json-error-diagnostic-information branch from f83e8c7 to c14e5de Compare June 17, 2026 09:07
@jschmidt-icinga jschmidt-icinga added this to the 2.17.0 milestone Jun 17, 2026
@jschmidt-icinga jschmidt-icinga added consider backporting Should be considered for inclusion in a bugfix release bug Something isn't working labels Jun 17, 2026
@jschmidt-icinga jschmidt-icinga force-pushed the better-json-error-diagnostic-information branch 2 times, most recently from 83f42f4 to 1c40368 Compare June 18, 2026 07:45
@jschmidt-icinga jschmidt-icinga added backport-to-support/2.16 PRs with this label will automatically be backported to the v2.16 support branch. and removed consider backporting Should be considered for inclusion in a bugfix release labels Jul 1, 2026

@julianbrost julianbrost left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a commit titled "Consistently use std::exception_ptr", but I noticed two (indirect) uses of boost::exception_ptr through calls to boost::current_exception() in the code base.

} catch (const std::exception& ex) {
BOOST_THROW_EXCEPTION(ScriptError("Error while evaluating expression: " + String(ex.what()), GetDebugInfo())
<< boost::errinfo_nested_exception(boost::current_exception()));
}

That one is probably fine given that boost::errinfo_nested_exception lives within the boost exception universe even though the documentation leaves the exact exception pointer type unspecified.

However, there's a second instance where we can pretty sure get rid of the implicit boost::exception_ptr:

} catch (const ScriptError&) {
/* Re-throw the exception for the outside try-catch block. */
boost::rethrow_exception(boost::current_exception());
} catch (const std::exception& ex) {

Not sure why this isn't just throw;.

Apart from that, this is looking fine. Spun it up for a test, the speedup is amazing. This is so stupid overall 🤣

@jschmidt-icinga

Copy link
Copy Markdown
Contributor Author

} catch (const std::exception& ex) {
BOOST_THROW_EXCEPTION(ScriptError("Error while evaluating expression: " + String(ex.what()), GetDebugInfo())
<< boost::errinfo_nested_exception(boost::current_exception()));
}

As I noted in the PR description boost::exception_ptr implementations can be divided into three groups. Those that just alias to std::exception_ptr, those that can construct from it, and those that can't. So sadly, in this case if we'd really want to use std::exception_ptr, we'd have to define a new boost::error_info specialization for it.

Not sure why this isn't just throw;.

No idea. I'll add it in a separate commit.

@jschmidt-icinga jschmidt-icinga force-pushed the better-json-error-diagnostic-information branch from 1c40368 to f2b971e Compare July 2, 2026 11:37
This was likely meant as a fallback for non-`std::exception`
exception pointers, but it would never have been reached, because
`(std|boost)::rethrow_exception()` would have thrown those out of
the function scope.

This fixes the fallback by making it a catch handler and also using
the more direct way of getting this fallback diagnostic info inside
a catch handler.
`DiagnosticInformation()` wasn't able to take a `std::exception_ptr` due
to the missing conversion on older boost versions, so now everything uses
the std::exception_ptr instead. There are still a few reasons to use
`boost::exception` in some places, but for exception pointers, the standard
one should be better in most cases and almost never requires to include an
extra header.
The `diagnostic_information` string will now generated if and
when it is required (i.e. the "verbose" parameter is true).
@jschmidt-icinga jschmidt-icinga force-pushed the better-json-error-diagnostic-information branch from f2b971e to 171e9ee Compare July 2, 2026 12:19
@jschmidt-icinga jschmidt-icinga enabled auto-merge July 2, 2026 14:31
@jschmidt-icinga jschmidt-icinga merged commit e0a138a into master Jul 2, 2026
30 checks passed
@jschmidt-icinga jschmidt-icinga deleted the better-json-error-diagnostic-information branch July 2, 2026 15:41
@backbot-ci

backbot-ci Bot commented Jul 2, 2026

Copy link
Copy Markdown

Backport failed for support/2.16, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin support/2.16
git worktree add -d .worktree/backport-10883-to-support/2.16 origin/support/2.16
cd .worktree/backport-10883-to-support/2.16
git switch --create backport-10883-to-support/2.16
git cherry-pick -x aa6c63b52eed9efcef03deb0193e574f3cb24e62 a9783fdde45a73fda40cc4c19fde3a64de6edae2 393724bb2a1c63e718a2f06b369f102258c9b98b 171e9ee37a1db3b13843f9001b162a420dd21d4a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-to-support/2.16 PRs with this label will automatically be backported to the v2.16 support branch. bug Something isn't working cla/signed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parallel process-check-result requests for non-existing objects degrade Icinga 2 performance and cause overdue checks

2 participants